125 research outputs found

    Example-based machine translation of the Basque language

    Get PDF
    Basque is both a minority and a highly inflected language with free order of sentence constituents. Machine Translation of Basque is thus both a real need and a test bed for MT techniques. In this paper, we present a modular Data-Driven MT system which includes different chunkers as well as chunk aligners which can deal with the free order of sentence constituents of Basque. We conducted Basque to English translation experiments, evaluated on a large corpus (270, 000 sentence pairs). The experimental results show that our system significantly outperforms state-of-the-art approaches according to several common automatic evaluation metrics

    Comparing rule-based and data-driven approaches to Spanish-to-Basque machine translation

    Get PDF
    In this paper, we compare the rule-based and data-driven approaches in the context of Spanish-to-Basque Machine Translation. The rule-based system we consider has been developed specifically for Spanish-to-Basque machine translation, and is tuned to this language pair. On the contrary, the data-driven system we use is generic, and has not been specifically designed to deal with Basque. Spanish-to-Basque Machine Translation is a challenge for data-driven approaches for at least two reasons. First, there is lack of bilingual data on which a data-driven MT system can be trained. Second, Basque is a morphologically-rich agglutinative language and translating to Basque requires a huge generation of morphological information, a difficult task for a generic system not specifically tuned to Basque. We present the results of a series of experiments, obtained on two different corpora, one being “in-domain” and the other one “out-of-domain” with respect to the data-driven system. We show that n-gram based automatic evaluation and edit-distance-based human evaluation yield two different sets of results. According to BLEU, the data-driven system outperforms the rule-based system on the in-domain data, while according to the human evaluation, the rule-based approach achieves higher scores for both corpora

    The ADAPT system description for the IWSLT 2018 Basque to English translation task

    Get PDF
    In this paper we present the ADAPT system built for the Basque to English Low Resource MT Evaluation Campaign. Basque is a low-resourced, morphologically-rich language. This poses a challenge for Neural Machine Translation models which usually achieve better performance when trained with large sets of data. Accordingly, we used synthetic data to improve the translation quality produced by a model built using only authentic data. Our proposal uses back-translated data to: (a) create new sentences, so the system can be trained with more data; and (b) translate sentences that are close to the test set, so the model can be fine-tuned to the document to be translated

    Patrixa: A unification-based parser for Basque and its application to the automatic analysis of verbs

    Get PDF
    In this chapter we describe a computational grammar for Basque, and the first results obtained using it in the process of automatically acquiring subcategorization information about verbs and their associated sentence elements (arguments and adjuncts).In section 1 we describe the Basque syntax and the grammar we have developed for its treatment. The grammar is partial in the sense that it cannot recognize every sentence in real texts, but it is capable of describing the main syntactic elements, such as noun-phrases (NPs), prepositional phrases (PPs), and subordinate and simple sentences. This can be useful for several applications.In section 2 we explain the syntactic analyzer (or parser) used to automatically acquire information on verbal subcategorization from texts. The results will later be used by a linguist or processed by statistical filters.This work has been done by the IXA Natural Language Processing research group, centered on the application of automatic methods to the analysis of Basque

    Strategies to develop Language Technologies for Less-Resourced Languages based on the case of Basque

    Get PDF
    IXA group has developed during 23 years a basic set of resources, tools and applications for Basque following to an initial strategy which has been adapted according to technological changes. We think that our strategy and experience can be a reference for other less resourced languages. According to a six level classification of world languages, we estimate that this strategy may be useful for several hundred languages, those that have developed a written standard but that still are beginners in Human Language Technology

    Teknologia garatzeko estrategiak baliabide urriko hizkuntzetarako: euskararen eta Ixa taldearen adibidea

    Get PDF
    El artículo comienza presentando varios datos que muestran la situación de la lengua vasca, y a continuación proponiendo una clasificación para las lenguas del mundo según sea su presencia en Internet y en la tecnología de la lengua. El cuerpo del artículo presenta el trabajo hecho por el grupo Ixa en el campo del procesamiento automático del euskara, identificando sus siete hitos principales y describiendo la estrategia que ha guiado este desarrollo. Se plantea que esta estrategia puede servir como referencia para 190 lenguas que según la lasificación propuesta no poseen recursos de tecnología de la lengua pero si poseen una mínima presencia significativa en Internet.Euskararen egoeraren inguruan hainbat datu ematen dira labur-labur, eta horrekin batera munduko hizkuntzak sailkatzeko proposamen bat aurkezten da Interneten eta hizkuntz teknologian duten egoeren araberakoa. Euskararen prozesaketa automatikoan Ixa taldeak izan duen bilakaeraren nondik norakoak zehazten dira gero, hainbat mugarri azpimarratuz eta ibilbide hori jarraitzeko erabili den estrategia deskribatuz. Munduko 190 hizkuntzentzat erreferentzia izan daiteke estrategia hori, hain zuen, Interneten presentzia minimo eduki bai baina oraindik hizkuntza-teknologia mota hau landu ez duten hizkuntzentzat

    Weba euskarazko corpus gisa

    Get PDF
    The Basque language. just as any other, needs text corpora to survive in the modern world and to be used normally. But Basque corpora are few and small compared to those in other major languages. This is so because other languages have made use of the "Web-as-Corpus" approach , which consists of using the web as a corpus or as a source of texts for corpora. ln this paper, we describe the research carried out in his PhD thesis by the first author, under the supervision of the other two authors, to use the web and automatic methods for Basque corpus building, and also the tools developed and the results obtained. Out of them we can conclude that the "Web-as-Corpus" approach is val id to improve the state of Basque corpora , since with the developed tools we have collected quality corpora of different types (very large general corpora, specialized corpora, comparable corpora ... ) and built a service to query the web as a Basque corpus.Many of these tools and services ha ve already been placed online for their public use.; Euskarak, beste edozein hizkuntzak bezala , testu-corpusak behar ditu mundu modernoan bizirauteko eta normalki erabiltzeko. Alabaina , euskarazko corpusak gutxi eta txikiak dira , beste hizkuntza handiagoenekin konparatuz gero. Hori horrela da beste hizkuntzek "Web-as-Corpus" izeneko planteamendua baliatu dutelako, hau da, weba erabili dutelako corpus gisa edo corpusak osatzeko testu-iturritzat . Artikulu honetan azaltzen dira bere doktorego-tesian lehenengo autoreak, beste bi autoreen zuzendaritzapean, euskarazko corpusgintzarako weba eta metodo automatikoak baliatzeko egindako ikerketak, aratutako tresnak eta lortutako emaitzak . Horietatik ondorioztatu daiteke "Web-as-Corpus" planteamendua baliagarria dela euskarazko corpusen egoera hobetzeko, garatu diren tresna informatikoen bidez weba corpus gisa kontsultatzeko tresna bat eraiki baita eta mota askotako eta kalitatezko corpusak lortu ahal izan baitira (corpus orokor oso handiak, corpus espezializatuak, corpus konparagarriak, .. ). Horietako asko jada online gizartearen eskura jarri dira
    corecore